Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Scale Zipline for large `data` a la Quantopian

I realize there's a Zipline Google Group, and also posts about computing zipline backtests in parallel, but this question is different: How does quantopian quickly load large data values? When running minute backtests over several years with hundreds or thousands of symbols, data grows large, causing 2 problems I'm trying to solve (and believe Quantopian already solved):

  1. How is this data quickly loaded when the backtest starts? Databases seem slow as this data can be in the GB size. Is data memcached? How is it broken up if so?
    1. How is this data shared between backtests? Quantopian is essentially letting many backtests happen in parallel. While they use different algos, they borrow from the same data. Does each algo instance really get it's own copy of historical data in memory, or is there something like memcache used for each algo to pull its day or minute data from memcache as needed?

In a zipline scenario, if i want to test the same algo over the same time range with the same data, I could save a lot of time in the parallel tests if they could share the data object in memory. Even if i memcache historical data then load it into a Pandas Panel for each test, that creates a complete copy of static data for each parallel test (ie one copy per CPU core plus what's already in memcache!). Seems a waste of RAM.

6 responses

Hi Jason,

On Quantopian, the OHLCV data are stored in bcolz ctables on disk, which allows for efficient querying while avoiding the network I/O. The ctables structures are separate python objects from the data instance that's passed to handle_data, but they are the ultimate source of the values returned when you slice into data.

The zipline data source implementation currently used on Quantopian isn't open sourced yet, but here you can find the classes that integrate that same data format into pipeline.

-Rich

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks a ton @Rich!

To anyone reading this, I'm interested in cataloging unofficial libraries which allow:

  1. Alternate data sources for zipline (ie SQL DB, bcolz, memcache, etc)
  2. Alternate data generation for each simulation event (minute or day)

The end goal being to aid efficient clustered/parallel backtesting with zipline. The 2nd bullet above is especially interesting because one could, for example, memcache data for each sim event, and have a cluster of zipline simulators pull from the same memcache for each event. This would eliminate the need for loading caches before each test so long as the algos being simulated used universes which were subsets of what's memcached.

As a followup to anyone else interested in this, the easiest path forward is to create a custom Data Source. I used this fantastic example by cowmoo. I chose to memcache:

  1. my stock list so that each algo could fetch it when the algo started
  2. price for each stock for each minute
  3. then use the memcache client .get() method inside the example's generator (~line 103 where randPrice is set) to set the price which will get yielded.

Any code to share ?

Sure! I just got it working today after some additional tweaks.
https://gist.github.com/hamx0r/ebaeab00c0039ccf07104bdc64bc072f

There are are a couple issues though.

  1. Namely, when the algo starts running, it tries to run initialize (but not initialize as defined in my algo class), and fails. So i hacked the zipline code to not initialize
  2. But now orders never seem to get filled - i can see open orders, but my portfolio never changes.

I'm investigating and plan to post to the Zipline google group shortly.

Awesome. Will look into it.